Exploratory Data Analysis

Author

Michael Love

Published

July 19, 2025

Introduction

Exploratory data analysis (EDA), including data inspection and visualization, is a key component of data analysis. Performing routine statistical analysis such as linear modeling or statistical tests, without examination of data, can result in incorrect scientific inference, both in the direction of false findings and missed discoveries, i.e. false positives and negatives.

In this document, we will walk through some common EDA steps as part of the analysis of a particular genomics dataset. No particular genomics knowledge is required for this lecture note, and any additional information that is useful for examining the data will be provided during the data exploration. As with any data, learning a bit about the expected relationships is useful, so feel free to consult e.g. Wikipedia or other sources.

We will start with basic data import and high-level plots, then move on to more advanced analyses such as principal components analysis (PCA), concluding with some specific genome-focused analysis, and complex manipulation of the data, organized into a rich data object. We will describe how organizing data into objects helps EDA and can help to avoid common bookkeeping errors.

Reading and looking at data

The Alasoo et al. 2018 dataset

To find meaningful patterns in data, we need to first consider a bit about how they were generated, and the experimental design.

In this document, I will work through examples from summary data associated with a particular experiment, published in 2018:

Alasoo K, Rodrigues J, Mukhopadhyay S, Knights AJ, Mann AL, Kundu K; HIPSCI Consortium; Hale C, Dougan G, Gaffney DJ. Shared genetic effects on chromatin and gene expression indicate a role for enhancer priming in immune response. Nat Genet. 2018 Mar;50(3):424-431. doi: 10.1038/s41588-018-0046-7. Epub 2018 Jan 29. PMID: 29379200; PMCID: PMC6548559.

https://pubmed.ncbi.nlm.nih.gov/29379200/

The key experiment design is described in this section:

We focussed on enhancer priming in the context of human macrophage immune response. To ensure sufficient numbers of cells, we differentiated macrophages from a panel of 123 human induced pluripotent cell lines (iPSCs) obtained from the HipSci project. We profiled gene expression (RNA-seq) and chromatin accessibility (ATAC-seq) in a subset of 86 successfully differentiated lines … in four experimental conditions: naive (N), 18 hours IFNγ stimulation (I), 5 hours Salmonella enterica serovar Typhimurium (Salmonella) infection (S), and IFNγ stimulation followed by Salmonella infection (I+S).

To restate, the authors used induced stem cell lines, differentiated these into macrophage-like cells (so capable of immune response), and then subjected the cells to three treatments (plus control condition), in order to explore the cellular immune response. Key considerations to the experimental design here:

  1. Different human donors will have different genotypes, and so can be used to study genetic effects on immune response.

  2. Interferon gamma (IFNg or IFNγ) stimulation primes the immune system to respond to Salmonella infection, and so may show a different effect than each treatment alone.

In the summary data we will work with, we do not have genotype information, so we will ignore this aspect of the data (the paper is very interesting and focuses on this, so do read further if you are interested). The full data is available but requires an formal application to be submitted for data use.

A bit of omics background

Overview: We will look at data that summarizes the RNA-seq and ATAC-seq, in cell lines from diverse human donors, across four experimental conditions. If these experiments are not familiar, all we need to know is that RNA-seq measures the expression of genes, and ATAC-seq measures the accessibility of chromatin across the genome. They use DNA sequencing machines to make these measurements, but that detail is not critical right now. Accessibility is another word for the chromatin being less compact, such that the DNA is more open and available to DNA-binding proteins. Compact chromatin is less accessible for DNA-binding proteins.

Motivation: We are interested in measuring the average accessibility of chromatin across cells in the sample, as these DNA-binding proteins often mediate cellular response by binding to specific DNA sequences and influencing transcription of genes on the same chromosomes (referred to as cis in genetics for occurring on the same molecule) and “nearby” in the linear molecule. These DNA-binding proteins could be called regulatory proteins, or “transcription factors”, and the piece of DNA that they bind to could be called a regulatory element, or more specifically a cis-regulatory element (CRE). For our purposes, “nearby” on the chromosome may mean 10-100 thousands of basepairs in either direction from the location that the protein binds. For interpreting this paragraph, take a look at panels Fig 1a and 1b in the published article.

Data interpretation: The data will consist of counts organized into a matrix. The arrangement of the matrix is described below, but I first describe at a high level, the meaning of the counts that appear within. A count in the matrix is an observation of particular region of DNA being accessible in a particular sample. The measurements are relative in that the total count for one sample across all the regions is basically arbitrary. A zero or a low count means that the region was less accessible than a region with a higher count, within a sample.

File descriptions

Summary versions of the expression and chromatin accessibility data are available here:

https://zenodo.org/records/1188300

I have downloaded a few of these files and made small versions of them in this git repository. I subset the chromatin accessibility (ATAT-seq) data to the first 1000 rows. In this data a row corresponds to a region of the genome (a potential regulatory element). Columns will correspond to samples (human cell lines in one of the treatment conditions). There are three important files:

  1. the count matrix describing accessibility
  2. the genome regions being described (rows of the matrix). these happen to be called “peaks”
  3. the samples being assayed (columns of the matrix)

Reading in data

Let’s start by reading in the counts matrix. Here I use the basic read.delim() function to read in a matrix of data, and then convert to something called a “tibble”. Two packages used here:

  1. tibble for storing the information
  2. here for referring to local files
library(tibble)
library(here)
# this file has no column name for first column (these are e.g. row names)
counts <- read.delim(here("labs/exploratory/counts.txt.gz"))
counts <- counts |> 
  rownames_to_column("peak") |>
  as_tibble()
counts
# A tibble: 1,000 × 146
   peak  bima_A_ATAC bima_B_ATAC bima_C_ATAC bima_D_ATAC eofe_A_ATAC eofe_B_ATAC
   <chr>       <int>       <int>       <int>       <int>       <int>       <int>
 1 ATAC…         283         278         325         282         303         436
 2 ATAC…          24         216          44          75          42          77
 3 ATAC…          33          54          36          23          20          23
 4 ATAC…          44          44          65          36          42          25
 5 ATAC…          24          23          17          10          13           9
 6 ATAC…          32          32          33          19          31          15
 7 ATAC…         370         560         537         431         327         380
 8 ATAC…          26          20          23          11          15          17
 9 ATAC…          27          24          22          17          44          29
10 ATAC…          11          29          54          43          13          57
# ℹ 990 more rows
# ℹ 139 more variables: eofe_C_ATAC <int>, eofe_D_ATAC <int>,
#   eika_B_ATAC <int>, eika_A_ATAC <int>, ieki_B_ATAC <int>, ieki_A_ATAC <int>,
#   nusw_B_ATAC <int>, nusw_A_ATAC <int>, ruyv_B_ATAC <int>, ruyv_A_ATAC <int>,
#   uimo_B_ATAC <int>, uimo_A_ATAC <int>, cicb_B_ATAC <int>, cicb_A_ATAC <int>,
#   cicb_D_ATAC <int>, cicb_C_ATAC <int>, vass_B_ATAC <int>, vass_A_ATAC <int>,
#   vass_D_ATAC <int>, vass_C_ATAC <int>, oefg_B_ATAC <int>, …

The last line prints the tibble to the console which shows the first 10 rows in this case, and then states the remaining number of rows. We see a selection of the columns, which are samples.

Some brief terminology: “peaks” refers to regions of the genome. They are called “peaks” because the raw data from the experiment looks like a peak of a mountain when visualized, because the observations of openness stack on top of each other. We will continue with the domain-specific work “peak” but when you see this word you can just think “region of the genome”.

A quick note on tibble vs data.frame, these are often equivalent. Tibble is used often with packages such as dplyr, and has more convenient display in my opinion. Anther detail is that the tibble cannot have row names, only column names.

test_dat <- tibble(foo = c("a","b","c"), bar = c(-1.1, 2.0, 10.56))
test_dat
# A tibble: 3 × 2
  foo     bar
  <chr> <dbl>
1 a      -1.1
2 b       2  
3 c      10.6
test_dat |> as.data.frame()
  foo   bar
1   a -1.10
2   b  2.00
3   c 10.56
library(readr)
peaks <- read_delim(here("labs/exploratory/peaks.txt.gz"))
Rows: 1000 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): gene_id, strand, gene_name
dbl (6): length, percentage_gc_content, chr, start, end, score

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
peaks
# A tibble: 1,000 × 9
   gene_id length percentage_gc_content   chr start   end strand score gene_name
   <chr>    <dbl>                 <dbl> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>    
 1 ATAC_p…    690                 0.555     1  9979 10668 +       1000 ATAC_pea…
 2 ATAC_p…    535                 0.675     1 10939 11473 +       1000 ATAC_pea…
 3 ATAC_p…    225                 0.618     1 15505 15729 +       1000 ATAC_pea…
 4 ATAC_p…    334                 0.572     1 21148 21481 +       1000 ATAC_pea…
 5 ATAC_p…    204                 0.559     1 21864 22067 +       1000 ATAC_pea…
 6 ATAC_p…    149                 0.544     1 26026 26174 +       1000 ATAC_pea…
 7 ATAC_p…    822                 0.718     1 28665 29486 +       1000 ATAC_pea…
 8 ATAC_p…    105                 0.676     1 29745 29849 +       1000 ATAC_pea…
 9 ATAC_p…    159                 0.509     1 29935 30093 +       1000 ATAC_pea…
10 ATAC_p…    328                 0.415     1 33060 33387 +       1000 ATAC_pea…
# ℹ 990 more rows

And the sample information, in a file referring to “column data”, because the columns of the matrix are samples:

coldata <- read_delim(here("labs/exploratory/coldata.txt.gz"))
Rows: 145 Columns: 47
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (19): sample_id, donor, condition_char, condition_name, SL1344, IFNg, i...
dbl  (18): multiplex_pool, clone, replicate, passage_received, passage_diff,...
date (10): stimulation_date, submission_date, ips_received, ips_started, EB_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
coldata
# A tibble: 145 × 47
   sample_id   donor condition_char condition_name SL1344 IFNg  stimulation_date
   <chr>       <chr> <chr>          <chr>          <chr>  <chr> <date>          
 1 bima_A_ATAC bima  A              naive          contr… naive 2014-12-12      
 2 bima_B_ATAC bima  B              IFNg           contr… prim… 2014-12-12      
 3 bima_C_ATAC bima  C              SL1344         infec… naive 2014-12-12      
 4 bima_D_ATAC bima  D              IFNg_SL1344    infec… prim… 2014-12-12      
 5 eofe_A_ATAC eofe  A              naive          contr… naive 2014-12-12      
 6 eofe_B_ATAC eofe  B              IFNg           contr… prim… 2014-12-12      
 7 eofe_C_ATAC eofe  C              SL1344         infec… naive 2014-12-12      
 8 eofe_D_ATAC eofe  D              IFNg_SL1344    infec… prim… 2014-12-12      
 9 eika_B_ATAC eika  B              IFNg           contr… prim… 2015-04-17      
10 eika_A_ATAC eika  A              naive          contr… naive 2015-04-17      
# ℹ 135 more rows
# ℹ 40 more variables: submission_date <date>, i5_tag <chr>, i5_sequence <chr>,
#   i7_tag <chr>, i7_sequence <chr>, nextera_tag <chr>, multiplex_pool <dbl>,
#   bioanalyzer <chr>, nextera_kit <chr>, line_id <chr>, clone <dbl>,
#   replicate <dbl>, ips_received <date>, ips_started <date>,
#   passage_received <dbl>, received_as <chr>, media <chr>,
#   EB_formation <date>, passage_diff <dbl>, diff_start <date>, …
Question

How many experiments do you spot per cell line donor?

We now have three pieces of data in our R session, but each alone doesn’t tell us that much. We will see as we step through the analysis that it is more convenient to tie these tables together.

Quick summaries

We can use the glimpse() function from the dplyr package for quick summaries of the dataset. dplyr also provides many useful utilities for subsetting and manipulating tabular data, which we will see later.

library(dplyr)
dplyr::glimpse(peaks)
Rows: 1,000
Columns: 9
$ gene_id               <chr> "ATAC_peak_1", "ATAC_peak_2", "ATAC_peak_3", "AT…
$ length                <dbl> 690, 535, 225, 334, 204, 149, 822, 105, 159, 328…
$ percentage_gc_content <dbl> 0.555072, 0.674766, 0.617778, 0.571856, 0.558824…
$ chr                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ start                 <dbl> 9979, 10939, 15505, 21148, 21864, 26026, 28665, …
$ end                   <dbl> 10668, 11473, 15729, 21481, 22067, 26174, 29486,…
$ strand                <chr> "+", "+", "+", "+", "+", "+", "+", "+", "+", "+"…
$ score                 <dbl> 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, …
$ gene_name             <chr> "ATAC_peak_1", "ATAC_peak_2", "ATAC_peak_3", "AT…

Note that the first column “gene_id” is a misnomer (wrong name), as we are not looking at genes. This sometimes happens in genomics, when a file format is re-used for other types of data.

dplyr::glimpse(coldata)
Rows: 145
Columns: 47
$ sample_id            <chr> "bima_A_ATAC", "bima_B_ATAC", "bima_C_ATAC", "bim…
$ donor                <chr> "bima", "bima", "bima", "bima", "eofe", "eofe", "…
$ condition_char       <chr> "A", "B", "C", "D", "A", "B", "C", "D", "B", "A",…
$ condition_name       <chr> "naive", "IFNg", "SL1344", "IFNg_SL1344", "naive"…
$ SL1344               <chr> "control", "control", "infected", "infected", "co…
$ IFNg                 <chr> "naive", "primed", "naive", "primed", "naive", "p…
$ stimulation_date     <date> 2014-12-12, 2014-12-12, 2014-12-12, 2014-12-12, …
$ submission_date      <date> 2015-01-20, 2015-01-20, 2015-01-20, 2015-01-20, …
$ i5_tag               <chr> "N502", "N502", "N502", "N502", "N502", "N502", "…
$ i5_sequence          <chr> "CTCTCTAT", "CTCTCTAT", "CTCTCTAT", "CTCTCTAT", "…
$ i7_tag               <chr> "N702", "N701", "N704", "N703", "N706", "N705", "…
$ i7_sequence          <chr> "CGTACTAG", "TAAGGCGA", "TCCTGAGC", "AGGCAGAA", "…
$ nextera_tag          <chr> "Nextera tag_14", "Nextera tag_13", "Nextera tag_…
$ multiplex_pool       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ bioanalyzer          <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", …
$ nextera_kit          <chr> "kit_24_1", "kit_24_1", "kit_24_1", "kit_24_1", "…
$ line_id              <chr> "bima_1", "bima_1", "bima_1", "bima_1", "eofe_1",…
$ clone                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1…
$ replicate            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ ips_received         <date> 2014-10-24, 2014-10-24, 2014-10-24, 2014-10-24, …
$ ips_started          <date> 2014-10-24, 2014-10-24, 2014-10-24, 2014-10-24, …
$ passage_received     <dbl> 40, 40, 40, 40, 35, 35, 35, 35, 28, 28, 38, 38, 2…
$ received_as          <chr> "alive", "alive", "alive", "alive", "alive", "ali…
$ media                <chr> "E8", "E8", "E8", "E8", "KSR", "KSR", "KSR", "KSR…
$ EB_formation         <date> 2014-11-04, 2014-11-04, 2014-11-04, 2014-11-04, …
$ passage_diff         <dbl> 42, 42, 42, 42, 36, 36, 36, 36, 32, 32, 42, 42, 3…
$ diff_start           <date> 2014-11-07, 2014-11-07, 2014-11-07, 2014-11-07, …
$ salmonella_date      <date> 2014-12-12, 2014-12-12, 2014-12-12, 2014-12-12, …
$ atac_date            <date> 2014-12-12, 2014-12-12, 2014-12-12, 2014-12-12, …
$ flow_date            <date> 2014-12-16, 2014-12-16, 2014-12-16, 2014-12-16, …
$ medium_changes       <dbl> 6, 6, 6, 6, 8, 8, 8, 8, 9, 9, 11, 11, 9, 9, 9, 9,…
$ terminated           <date> 2014-12-19, 2014-12-19, 2014-12-19, 2014-12-19, …
$ status               <chr> "Success", "Success", "Success", "Success", "Succ…
$ mean_purity          <dbl> 0.9850139, 0.9850139, 0.9850139, 0.9850139, 0.977…
$ max_purity           <dbl> 0.9889400, 0.9889400, 0.9889400, 0.9889400, 0.979…
$ genotype_id          <chr> "HPSI1113i-bima_1", "HPSI1113i-bima_1", "HPSI1113…
$ sex                  <chr> "male", "male", "male", "male", "female", "female…
$ ips_culture_days     <dbl> 11, 11, 11, 11, 5, 5, 5, 5, 26, 26, 23, 23, 18, 1…
$ passage_diff_bins    <dbl> 40, 40, 40, 40, 30, 30, 30, 30, 30, 30, 40, 40, 3…
$ max_purity_filtered  <dbl> 0.9889400, 0.9889400, 0.9889400, 0.9889400, 0.979…
$ mean_purity_filtered <dbl> 0.9850139, 0.9850139, 0.9850139, 0.9850139, 0.977…
$ assigned             <dbl> 20213226, 27158360, 29720460, 23275756, 21807156,…
$ assigned_frac        <dbl> 0.7110973, 0.7276358, 0.6838026, 0.6774437, 0.666…
$ mt_frac              <dbl> 0.22312846, 0.15385990, 0.14644551, 0.11751974, 0…
$ percent_duplication  <dbl> 0.086860, 0.106955, 0.094512, 0.079756, 0.090006,…
$ short_long_ratio     <dbl> 2.114210, 2.213082, 2.532606, 2.913072, 2.105560,…
$ peak_count           <dbl> 180460, 172172, 208337, 181080, 170418, 173927, 1…

Another useful package for summarizing data is skimr with the skim() function:

library(skimr)
skim(peaks)
Data summary
Name peaks
Number of rows 1000
Number of columns 9
_______________________
Column type frequency:
character 3
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gene_id 0 1 11 14 0 1000 0
strand 0 1 1 1 0 1 0
gene_name 0 1 11 14 0 1000 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
length 0 1 306.00 223.28 52.00 167.00 232.00 367.50 1529.00 ▇▂▁▁▁
percentage_gc_content 0 1 0.59 0.11 0.31 0.51 0.59 0.67 0.88 ▁▆▇▅▁
chr 0 1 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁
start 0 1 3984249.02 2660147.44 9979.00 1553756.25 3502116.50 6493686.00 8373970.00 ▇▆▃▅▆
end 0 1 3984554.02 2660146.36 10668.00 1553961.00 3502247.00 6493817.00 8374329.00 ▇▆▃▅▆
score 0 1 1000.00 0.00 1000.00 1000.00 1000.00 1000.00 1000.00 ▁▁▇▁▁
Question

Scroll to the right in the numeric variable table. How would you describe the distribution of the different numeric columns? Do these distributions make sense? Note that our subset of the data contains just the first 1,000 regions on chromosome 1.

We will also run skim on the column data. We do this twice: once on the data after de-selecting the columns assigned and peak_count, and once just on those two columns. This is because the scale of those two columns is much larger than the other numberic columns, so it helps with printing to isolate them.

coldata |> select(!c(assigned, peak_count)) |> skim()
Data summary
Name select(coldata, !c(assign…
Number of rows 145
Number of columns 45
_______________________
Column type frequency:
character 19
Date 10
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sample_id 0 1.0 11 11 0 145 0
donor 0 1.0 4 4 0 44 0
condition_char 0 1.0 1 1 0 4 0
condition_name 0 1.0 4 11 0 4 0
SL1344 0 1.0 7 8 0 2 0
IFNg 0 1.0 5 6 0 2 0
i5_tag 0 1.0 4 4 0 8 0
i5_sequence 0 1.0 8 8 0 8 0
i7_tag 0 1.0 4 4 0 12 0
i7_sequence 0 1.0 8 8 0 12 0
nextera_tag 0 1.0 13 14 0 109 0
bioanalyzer 14 0.9 3 3 0 1 0
nextera_kit 0 1.0 6 8 0 5 0
line_id 0 1.0 6 6 0 44 0
received_as 0 1.0 5 12 0 3 0
media 0 1.0 2 3 0 2 0
status 0 1.0 7 7 0 1 0
genotype_id 0 1.0 16 16 0 44 0
sex 0 1.0 4 6 0 2 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
stimulation_date 0 1.00 2014-12-12 2015-12-04 2015-08-04 18
submission_date 0 1.00 2015-01-20 2015-12-09 2015-08-24 4
ips_received 0 1.00 2014-10-10 2015-09-02 2015-01-14 5
ips_started 0 1.00 2014-10-24 2015-09-15 2015-06-11 18
EB_formation 0 1.00 2014-10-29 2015-10-20 2015-06-28 36
diff_start 0 1.00 2014-11-01 2015-10-23 2015-07-01 36
salmonella_date 2 0.99 2014-12-12 2015-12-04 2015-08-04 18
atac_date 0 1.00 2014-12-12 2015-12-04 2015-08-04 17
flow_date 4 0.97 2014-12-04 2015-11-27 2015-08-05 17
terminated 0 1.00 2014-12-19 2015-12-04 2015-08-08 21

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
multiplex_pool 0 1.00 2.91 1.24 0.00 2.00 3.00 4.00 4.00 ▁▂▂▅▇
clone 0 1.00 1.95 1.14 1.00 1.00 2.00 3.00 6.00 ▇▃▁▁▁
replicate 0 1.00 1.03 0.16 1.00 1.00 1.00 1.00 2.00 ▇▁▁▁▁
passage_received 0 1.00 32.21 6.53 16.00 28.00 33.00 37.00 44.00 ▁▅▅▇▅
passage_diff 0 1.00 36.41 6.48 20.00 32.00 37.00 41.00 48.00 ▁▅▆▇▃
medium_changes 0 1.00 7.23 1.47 5.00 6.00 7.00 8.00 11.00 ▇▆▅▃▂
mean_purity 4 0.97 0.97 0.02 0.89 0.96 0.98 0.99 1.00 ▁▁▂▃▇
max_purity 4 0.97 0.98 0.02 0.90 0.97 0.98 0.99 1.00 ▁▁▁▃▇
ips_culture_days 0 1.00 20.30 6.53 5.00 16.00 19.00 23.00 41.00 ▂▇▆▂▁
passage_diff_bins 0 1.00 32.00 6.62 20.00 30.00 30.00 40.00 40.00 ▂▁▇▁▅
max_purity_filtered 4 0.97 0.98 0.02 0.90 0.97 0.98 0.99 1.00 ▁▁▁▃▇
mean_purity_filtered 4 0.97 0.97 0.02 0.89 0.96 0.98 0.99 1.00 ▁▁▂▃▇
assigned_frac 0 1.00 0.54 0.13 0.19 0.46 0.58 0.65 0.76 ▂▃▆▇▆
mt_frac 0 1.00 0.30 0.14 0.07 0.19 0.28 0.40 0.79 ▇▇▅▂▁
percent_duplication 0 1.00 0.09 0.03 0.05 0.08 0.09 0.11 0.21 ▅▇▂▁▁
short_long_ratio 0 1.00 2.11 1.03 0.91 1.47 1.82 2.42 6.96 ▇▂▁▁▁

Note that the meaning of the columns is described in a metadata file at the Zenodo link.

Question

The assigned_frac column is described as the “fraction of non-mitochondrial reads assigned to consensus peaks”. A “read” is an observation of accessibility in this context. What is a typical value? For a typical sample, are most of the observations within the regions called “peaks”?

coldata |> select(assigned, peak_count) |> skim()
Data summary
Name select(coldata, assigned,…
Number of rows 145
Number of columns 2
_______________________
Column type frequency:
numeric 2
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
assigned 0 1 15385402.5 5507593.4 3964874 11189997 15141993 19250016 31712153 ▃▇▇▃▁
peak_count 0 1 139919.5 34779.3 37015 116645 147211 164249 208337 ▁▂▆▇▃
Question

What is the general number of peaks for a typical sample? Is the distribution of number of peaks left or right skewed? The assigned column gives the “total number of paired-end fragments overlapping peaks”, that is the number of observations that are within the regions called “peaks”. What is a typical value here? Recall that the total number of observations per sample is arbitrary. This assigned number is influenced both by the total number of observations and by the enrichment of observations in the “peak” regions.

Note that we can also select by pattern, see ?select for more details:

coldata |> select(ends_with("_date"))
# A tibble: 145 × 5
   stimulation_date submission_date salmonella_date atac_date  flow_date 
   <date>           <date>          <date>          <date>     <date>    
 1 2014-12-12       2015-01-20      2014-12-12      2014-12-12 2014-12-16
 2 2014-12-12       2015-01-20      2014-12-12      2014-12-12 2014-12-16
 3 2014-12-12       2015-01-20      2014-12-12      2014-12-12 2014-12-16
 4 2014-12-12       2015-01-20      2014-12-12      2014-12-12 2014-12-16
 5 2014-12-12       2015-01-20      2014-12-12      2014-12-12 2014-12-04
 6 2014-12-12       2015-01-20      2014-12-12      2014-12-12 2014-12-04
 7 2014-12-12       2015-01-20      2014-12-12      2014-12-12 2014-12-04
 8 2014-12-12       2015-01-20      2014-12-12      2014-12-12 2014-12-04
 9 2015-04-17       2015-05-20      2015-04-14      2015-04-17 2015-04-13
10 2015-04-17       2015-05-20      2015-04-14      2015-04-17 2015-04-13
# ℹ 135 more rows

Cumulative densities, histograms, ggplot

Above we saw small text-based histograms of numeric data, which allow us to quickly look at the distribution of values. Here I explore that further, first introducing the ecdf() function. An empirical cumulative density function shows similar information to a histogram, as if the bars of the histogram were stacking on top of each other from left to right. But instead of breaking the range of the data on the x-axis into bins, each data point is represented. The value on the y-axis goes from 0 on the left edge of the data (minimum) to 1 on the right edge of the data (maximum). Reading the x-value where the y-value equals 0.5 gives the median.

Let’s make some ECDF plots for the 1,000 rows of the count data. For example, the column sums (in thousands):

dim(counts)
[1] 1000  146
counts[1:5,1:5]
# A tibble: 5 × 5
  peak        bima_A_ATAC bima_B_ATAC bima_C_ATAC bima_D_ATAC
  <chr>             <int>       <int>       <int>       <int>
1 ATAC_peak_1         283         278         325         282
2 ATAC_peak_2          24         216          44          75
3 ATAC_peak_3          33          54          36          23
4 ATAC_peak_4          44          44          65          36
5 ATAC_peak_5          24          23          17          10
col_sum <- counts |> select(-peak) |> colSums()
plot( ecdf(col_sum/1e3) )

Question

What is a typical value for the column sum of our counts matrix?

Let’s also look at row sums (in thousands). These are the total number of observations in a region of the genome across all samples.

row_sum <- counts |> select(-peak) |> rowSums()
plot( ecdf(row_sum/1e3) )

Question

What do you notice about the row sum in contrast to the column sum? Any outliers?

We can focus on the area where most of the rows are:

plot( ecdf(row_sum/1e3), xlim=c(0,10) )

Question

What’s a typical value for the row sum?

Another powerful package for making plots is ggplot2. This package is widely used and has many tutorials online. The “gg” stands for Grammar of Graphics, which refers the concept of splitting apart specification of plotting elements from the mapping of the variables to visual aesthetics. I find the best way to learn is by trying it out.

The ggplot() function requires input in the form of a tibble or dataframe. This is because it uses the names of the columns (variables) in mapping to the aesthetics.

In order to plot a histogram of the column sums, we first need to put this variable into a tibble, which is done on the first line below.

Then we call ggplot(), giving it the tibble, and in the second argument we map the column sum in thousands to the x aesthetic, within the function aes().

Finally we specify a “geometry” for the data, in this case a histogram. In ggplot2 we combine these operations with +.

library(ggplot2)
dat <- tibble(col_sum_thousands = col_sum / 1e3)
ggplot(
  data = dat,
  mapping = aes(x = col_sum_thousands)
) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can make a more interesting plot from the column data. Note that it is common and possible to leave off the names of the first two arguments x and y. Here we specify to make a geometry, point, which produces a scatterplot.

This time we leave off the names of the first and second argument of ggplot.

ggplot(
  coldata,
  aes(x = assigned, y = assigned_frac, color = condition_name)
) +
  geom_point()

Question

Check out the possible geoms at this link. Try different plots for the various character and numeric data columns in coldata.

Additional plot alelemnts can be added with +, building up to complex figures that are suitable for manuscripts. A few lines of code produces a scatterplot with separate regression lines for each group.

ggplot(
  coldata,
  aes(x = assigned, y = assigned_frac, color = condition_name)
) +
  geom_point() +
  stat_smooth(
    method = "lm",
    formula = "y ~ x",
    se = FALSE
  )

Question

Advanced: try using a different smoothing method, from ?stat_smoooth.

Visualizing matrix data

Matrix data, where each column provides a numeric value on the same scale, has specialized plots that helps us understand the structure of the data in the matrix. It often helps to look at transformed data, especially with counts. Below we take the logarithm (base 10) of the data plus a pseudocount of 1. While we can do this operation simply with log10(matrix + 1), here we use a dplyr approach, keeping the data in the tibble format.

log_counts <- counts |>
  select(-peak) |>
  mutate(across(everything(), ~ log10(.x + 1)))
log_counts_6 <- log_counts |> select(1:6)

Alternatively, we could have used more basic R code…

counts_subset <- as.matrix(counts[,2:7])
head( log10(counts_subset + 1) )
     bima_A_ATAC bima_B_ATAC bima_C_ATAC bima_D_ATAC eofe_A_ATAC eofe_B_ATAC
[1,]    2.453318    2.445604    2.513218    2.451786    2.482874    2.640481
[2,]    1.397940    2.336460    1.653213    1.880814    1.633468    1.892095
[3,]    1.531479    1.740363    1.568202    1.380211    1.322219    1.380211
[4,]    1.653213    1.653213    1.819544    1.568202    1.633468    1.414973
[5,]    1.397940    1.380211    1.255273    1.041393    1.146128    1.000000
[6,]    1.518514    1.518514    1.531479    1.301030    1.505150    1.204120

The package DataExplorer offers paneled plots across columns of matrix data. For example, to scan across a grid of histograms:

library(DataExplorer)
plot_histogram(log_counts_6, ncol = 3)

This is a really useful paradigm to check for samples with different distribution.

What if we have very many samples? We could break the samples into groups of say 12 (4 x 3) and make a PDF with many pages of histograms.

An alternative to this is to use ggplot2’s density function, to draw lines for each sample. But to do so, we must arrange the matrix data into a table, using tidyr package and the function pivot_longer(). This stacks the matrix columns into a single column, with an additional column specifying the sample name.

Here we will use color to identify sample, but we may have to drop this if we had hundreds of samples.

library(tidyr)
long_log_counts <- log_counts_6 |>
  pivot_longer(
    cols = everything(),
    names_to = "sample",
    values_to = "log_counts"
  )
ggplot(
  long_log_counts,
  aes(x = log_counts, color = sample)
) +
  geom_density()

Another option would be boxplots or violin plots.

ggplot(
  long_log_counts,
  aes(x = sample, y = log_counts)
) +
  geom_boxplot()

ggplot(
  long_log_counts,
  aes(x = sample, y = log_counts)
) +
  geom_violin() +
  stat_summary(fun = median, geom = "point", size=3)

The “QQ-plot” aligns the quantiles of the data with those of a Normal distribution (or another reference distribution). If the values fall roughly on a line, this means the data is roughly normally distributed.

plot_qq(log_counts_6, ncol = 3)

We can also make a matrix of correlations:

plot_correlation(log_counts_6)

There are a number of packages that provide more fine-grained correlation matrices, such as corrgram.

Question
  • What could you say, in broad terms, about the first six samples in terms of these plots?
  • For those who are more familiar with omics data such as gene expression and chromatin accessibility: suppose we also read in the gene expression data as a matrix. What might be a way to check that our data is properly aligned by sample across the two assays? What part of a gene must be accessible for it to be expressed? What steps would it take to perform this check?

Wrangling data into objects

Data wrangling is “the process of transforming and mapping data from one raw data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics” (from Wikipedia).

For the dataset we have been exploring, we haven’t yet made much use of the annotation of the rows (genome regions) or the columns (samples). We have three separate tables tracking this information, but we don’t even know if the tables are properly aligned (e.g. row 100 in the peaks could be row 900 in the counts data for all we know, likewise for coldata and columns of counts). To clean up our code, and prevent potential errors, it makes sense to formally organize our data and metadata across these multiple tables.

First we can do some checks, to make sure samples are present in both tables, and are in the same order. If these are not in order, we will have to re-arrange later.

table(coldata$sample_id %in% colnames(counts)[-1])

TRUE 
 145 
table(colnames(counts)[-1] %in% coldata$sample_id)

TRUE 
 145 
# not in order! so we have to be careful below
all.equal(coldata$sample_id, colnames(counts)[-1])
[1] "63 string mismatches"

Again for the rows of the counts, i.e. the genomic regions:

table(peaks$gene_id %in% counts$peak)

TRUE 
1000 
table(counts$peak %in% peaks$gene_id)

TRUE 
1000 
all.equal(peaks$gene_id, counts$peak)
[1] TRUE

“Object-oriented programming” helps with bookkeeping. The idea is to assemble our tables into a single “object”, which has specific slots and rules for coordinating operations like re-ordering or subsetting across the tables. Putting information into (setting) and pulling information out of (getting) the object will be handled by “accessor” and “assignment” functions. These look like foo(object) or foo(object) <- new_value.

Below we will create an object to combine and coordinate the counts (with rows as genomic regions and columns as samples) with the sample information (coldata). We can also add data about the rows (peaks).

Principal component analysis

Principal component analysis (PCA) is a highly useful EDA technique, which rotates our dataset into a new coordinate system. It’s helpful for plotting, diagnostics, machine learning, bias removal, and many more data analytic tasks. We will use it on our chromatin accessibility data to explore how the samples differ in the counts over the regions.

As we have a table with samples as columns, this means we will be rotating our data from the original space of p features (the number of rows) into a new space of p features where the first features are the most “important”. Mathemetically, we can say the first feature will have the most variance, defined as \(\frac{1}{n} \sum_i (x_i - \bar{x})^2\), and so on for the second, third, etc. We can also say that taking the top T rotated features are the best at preserving distances in the entire space. So it’s an efficient way to compress information into a set of ranked features.

A technical detail before we run PCA: we need to deal with the fact that each column of our matrix has an arbitrary column sum, as we saw before. This tends to not tell us much about the chromatin accessibility, and so hides important biological relationships.

plot(colSums(log_counts))

Without getting into too much detail, a decent approach is to simply scale the columns by some value. Here we will scale them by their median value.

scaled_counts <- counts |> 
  select(-peak) |>
  mutate(across(everything(),  ~ .x / median(.x))) |>
  as.matrix()

Now we compute PCA on the samples. We need to transpose with t() to make samples into rows. We take the logarithm (base 10) of the counts in addition, with a pseudocount of 1. There are many considerations about which transformation if any, but something like a logarithm on count data is useful for seeing patterns.

We can immediately make a plot of the data in the rotated space. A 2D scatterplot shows us PC1 and PC2.

pca <- prcomp(t(log10(scaled_counts + 1)))
plot(pca$x[,1:2])

We can ask, what original features are contributing substantially to these top rotated dimensions?

rot <- pca$rotation[,1:4] |> as_tibble()
rot |> 
  arrange(desc(abs(PC1))) |>
  head()
# A tibble: 6 × 4
     PC1      PC2      PC3      PC4
   <dbl>    <dbl>    <dbl>    <dbl>
1  0.173 -0.0665  -0.0134  -0.0321 
2  0.140 -0.0459  -0.0351  -0.00626
3  0.140 -0.0274  -0.0300  -0.0147 
4  0.128 -0.0310  -0.00752 -0.0504 
5 -0.123  0.00492  0.00451  0.0274 
6  0.122  0.0451  -0.0215  -0.0175 

And again for PC2:

rot |> 
  arrange(desc(abs(PC2))) |>
  head()
# A tibble: 6 × 4
       PC1   PC2        PC3    PC4
     <dbl> <dbl>      <dbl>  <dbl>
1  0.0596  0.162  0.00627   0.0598
2  0.0509  0.130 -0.0214    0.0582
3  0.0454  0.117  0.0117    0.0478
4 -0.00815 0.112  0.0108    0.0632
5  0.0629  0.108 -0.0126    0.0626
6  0.0442  0.105 -0.0000592 0.0382

However, this analysis would be much more valuable if we had sample and feature metadata alongside. Let’s arrange the coldata so it matches the columns of counts:

coldata_arr <- coldata |>
  arrange(match(sample_id, colnames(counts)[-1]))

Now we can put the counts and its column data together into an object. We can use the SummarizedExperiment package to build an object of the same name. This will tie together the two tables and facilitate data manipulation and plotting.

library(SummarizedExperiment)
se <- SummarizedExperiment(
  list(counts = counts |> select(-1) |> as.matrix()),
  colData = coldata_arr
)
se
class: SummarizedExperiment 
dim: 1000 145 
metadata(0):
assays(1): counts
rownames: NULL
rowData names(0):
colnames(145): bima_A_ATAC bima_B_ATAC ... zaui_C_ATAC zaui_D_ATAC
colData names(47): sample_id donor ... short_long_ratio peak_count

We can access variables about the rows and columns with rowData and colData. There is also a shortcut for accessing variables about the columns, simply using a $ and the variable name:

head( se$sample_id )
[1] "bima_A_ATAC" "bima_B_ATAC" "bima_C_ATAC" "bima_D_ATAC" "eofe_A_ATAC"
[6] "eofe_B_ATAC"
# same as
head( colData(se)$sample_id )
[1] "bima_A_ATAC" "bima_B_ATAC" "bima_C_ATAC" "bima_D_ATAC" "eofe_A_ATAC"
[6] "eofe_B_ATAC"

Additionally, I will define an ordering of the conditions, and define a new variable about the samples, condition:

fct_lvls <- c("naive","IFNg","SL1344","IFNg_SL1344")
se$condition <- factor( se$condition_name, fct_lvls )
table( se$condition )

      naive        IFNg      SL1344 IFNg_SL1344 
         42          41          31          31 

Another benefit to building such an object is that there are many R/Bioconductor packages that take in such an annotated matrix object. The DESeq2 package allows us to quickly scale, transform, and plot the data with a few lines of code. We avoid potential issues with the tables getting out of orer, as the entire object is passed from one line to the other (dds is just another type of SummarizedExperiment here).

library(DESeq2)
dds <- DESeqDataSet(se, ~condition)
dds <- estimateSizeFactors(dds)
vsd <- varianceStabilizingTransformation(dds) 
plotPCA(vsd, intgroup="condition")
using ntop=500 top features by variance

Question

What do you observe in the PCA plot colored by sample condition?

We actually have three aligned tables (remember, peaks and counts were already aligned by row). Let’s add the peak data

rowData(se) <- peaks |> 
  select(
    peak_id = gene_id, 
    length, 
    gc = percentage_gc_content)
rowData(se)
DataFrame with 1000 rows and 3 columns
            peak_id    length        gc
        <character> <numeric> <numeric>
1       ATAC_peak_1       690  0.555072
2       ATAC_peak_2       535  0.674766
3       ATAC_peak_3       225  0.617778
4       ATAC_peak_4       334  0.571856
5       ATAC_peak_5       204  0.558824
...             ...       ...       ...
996   ATAC_peak_996       158  0.588608
997   ATAC_peak_997       202  0.673267
998   ATAC_peak_998       138  0.615942
999   ATAC_peak_999       191  0.476440
1000 ATAC_peak_1000       360  0.536111

I will annotate the rows of the se object:

rownames(se) <- rowData(se)$peak_id

Next we will also add the particular information about the location of the peaks, to enable more sophisticated downstream EDA.

Case study: chromatin accessibility

look for regions that have varying accessibility

pca <- prcomp(t(assay(vsd)))
plot(pca$x[,1:2], col=vsd$condition)

rot <- pca$rotation[,1:4] |> as_tibble()
rot |> 
  arrange(desc(abs(PC1))) |>
  head()
# A tibble: 6 × 4
    PC1     PC2      PC3       PC4
  <dbl>   <dbl>    <dbl>     <dbl>
1 0.144 -0.0578  0.00976 -0.0276  
2 0.129 -0.0320  0.0112  -0.00507 
3 0.129 -0.0476  0.0158  -0.000589
4 0.111 -0.0326  0.0130  -0.0412  
5 0.110 -0.0573  0.00471 -0.00106 
6 0.108 -0.0362 -0.00587  0.0391  
pc1_features <- order(abs(rot$PC1), decreasing=TRUE) |> head(10)
par(mfrow = c(2, 2), mar = c(4.5, 4.5, .5, .5))
for (i in c(1,3,5,7)) {
  plotCounts(
    dds,
    gene = pc1_features[i],
    intgroup = "condition",
    ylim=c(1, 1000)
  )
}

set.seed(5)
par(mfrow = c(3, 3), mar = c(4.5, 4.5, .5, .5))
for (i in sample(1000, 9)) {
  plotCounts(
    dds,
    gene = i,
    intgroup = "condition",
    ylim = c(1, 1000)
  )
}

library(plyranges)
library(nullranges)
p <- peaks |>
  as_granges(seqnames = chr)
p
GRanges object with 1000 ranges and 5 metadata columns:
         seqnames          ranges strand |        gene_id    length
            <Rle>       <IRanges>  <Rle> |    <character> <numeric>
     [1]        1      9979-10668      + |    ATAC_peak_1       690
     [2]        1     10939-11473      + |    ATAC_peak_2       535
     [3]        1     15505-15729      + |    ATAC_peak_3       225
     [4]        1     21148-21481      + |    ATAC_peak_4       334
     [5]        1     21864-22067      + |    ATAC_peak_5       204
     ...      ...             ...    ... .            ...       ...
   [996]        1 8328316-8328473      + |  ATAC_peak_996       158
   [997]        1 8349052-8349253      + |  ATAC_peak_997       202
   [998]        1 8351796-8351933      + |  ATAC_peak_998       138
   [999]        1 8371011-8371201      + |  ATAC_peak_999       191
  [1000]        1 8373970-8374329      + | ATAC_peak_1000       360
         percentage_gc_content     score      gene_name
                     <numeric> <numeric>    <character>
     [1]              0.555072      1000    ATAC_peak_1
     [2]              0.674766      1000    ATAC_peak_2
     [3]              0.617778      1000    ATAC_peak_3
     [4]              0.571856      1000    ATAC_peak_4
     [5]              0.558824      1000    ATAC_peak_5
     ...                   ...       ...            ...
   [996]              0.588608      1000  ATAC_peak_996
   [997]              0.673267      1000  ATAC_peak_997
   [998]              0.615942      1000  ATAC_peak_998
   [999]              0.476440      1000  ATAC_peak_999
  [1000]              0.536111      1000 ATAC_peak_1000
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
p <- p |>
  mutate(
    pc1 = rot[, "PC1", drop = TRUE],
    abs_scaled_pc1 = abs(pc1)/max(abs(pc1))
  )
plot(p$abs_scaled_pc1)

tiles <- range(p) %>%
  tile_ranges(width=1e4) %>%
  mutate(id = seq_along(.))
tab <- tiles |> 
  join_overlap_left(p) |>
  group_by(id) |>
  summarize(sum_pc1 = sum(abs_scaled_pc1)) |>
  as_tibble() |>
  mutate(sum_pc1 = tidyr::replace_na(sum_pc1, 0))
plot(tab$sum_pc1)

which.max(tab$sum_pc1)
[1] 106
tab |> slice(which.max(sum_pc1))
# A tibble: 1 × 2
     id sum_pc1
  <int>   <dbl>
1   106    3.31
max_tile <- tiles |> slice(which.max(tab$sum_pc1))
p |>
  filter_by_overlaps(max_tile)
GRanges object with 8 ranges and 7 metadata columns:
      seqnames          ranges strand |       gene_id    length
         <Rle>       <IRanges>  <Rle> |   <character> <numeric>
  [1]        1 1059227-1059777      + | ATAC_peak_157       551
  [2]        1 1059840-1059926      + | ATAC_peak_158        87
  [3]        1 1063910-1064089      + | ATAC_peak_159       180
  [4]        1 1064187-1064553      + | ATAC_peak_160       367
  [5]        1 1065904-1066081      + | ATAC_peak_161       178
  [6]        1 1066809-1066992      + | ATAC_peak_162       184
  [7]        1 1067064-1067251      + | ATAC_peak_163       188
  [8]        1 1069109-1069645      + | ATAC_peak_164       537
      percentage_gc_content     score     gene_name       pc1 abs_scaled_pc1
                  <numeric> <numeric>   <character> <numeric>      <numeric>
  [1]              0.771325      1000 ATAC_peak_157 0.0308880       0.214616
  [2]              0.735632      1000 ATAC_peak_158 0.0243494       0.169184
  [3]              0.744444      1000 ATAC_peak_159 0.0428207       0.297526
  [4]              0.686648      1000 ATAC_peak_160 0.0962591       0.668826
  [5]              0.528090      1000 ATAC_peak_161 0.0401471       0.278949
  [6]              0.657609      1000 ATAC_peak_162 0.1046371       0.727038
  [7]              0.510638      1000 ATAC_peak_163 0.0923891       0.641936
  [8]              0.731844      1000 ATAC_peak_164 0.0444586       0.308907
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

How to bootstrap ranges?

seqlengths(p) <- c("1" = 9e6) # just for demo
boot <- bootRanges(p, blockLength = 1e5, R = 10)
boot |>
  filter_by_overlaps(max_tile) |>
  group_by(iter) |>
  summarize(
    num_overlaps = n(),
    sum_pc1 = sum(abs_scaled_pc1)
  )
DataFrame with 5 rows and 3 columns
   iter num_overlaps   sum_pc1
  <Rle>    <integer> <numeric>
1     1            4 0.7936053
2     3            4 0.1728121
3     5            1 0.0267067
4     6            2 0.1743093
5     9            1 0.5201442

Appendix

Tidy operations on rich objects

The following code chunks show how we can use new packages in the tidyomics project (within Biconductor), to manipulate objects using dplyr-like syntax.

library(SummarizedExperiment)
se_thin <- se
cols <- c("donor","condition","mean_purity","assigned","peak_count")
colData(se_thin) <- colData(se_thin)[,cols]
library(plyxp)
xp <- new_plyxp(se_thin) |>
  mutate(
    log_scaled_counts = log2(counts/.cols$assigned + 1),
    rows(row_sums = rowSums(.assays_asis$counts))
  )
library(tidySummarizedExperiment)
xp |>
  filter(rows(peak_id == "ATAC_peak_162")) |>
  se() |>
  ggplot(aes(condition, log_scaled_counts)) + 
  geom_point()
Warning in check_se_dimnames(se): tidySummarizedExperiment says: at least one
of the assays in your SummarizedExperiment have row names, but they don't
completely overlap between assays. It is strongly recommended to make the
assays consistent, to avoid erroneous matching of features.

iSEE - interactive exploratory applications

iSEE is a Bioconductor package (with many associated packages) that allows for interactive visualization of data objects. You can try it out with the following:

library(iSEE)
iSEE(se_thin)

Missing data

Identifying and characterizing missingness is important! The naniar package helps to deal with and visualize patterns of missingness. As our above worked example dataset doesn’t contain very many missing values, here I simulate a dataset with missingness.

library(naniar)
dat_with_na <- tibble(
  foo = rnbinom(500, mu=20, size=5),
  bar = rnbinom(500, mu=20, size=5)
)
dat_with_na <- dat_with_na |>
  mutate(across(
    everything(),
    ~ {
      na_idx <- sample(length(.x), 100)
      .x[na_idx] <- NA
      .x
    }))
library(ggplot2)
ggplot(dat_with_na, aes(foo, bar)) + 
  geom_miss_point()

dat_with_na |>
  bind_shadow()
# A tibble: 500 × 4
     foo   bar foo_NA bar_NA
   <dbl> <dbl> <fct>  <fct> 
 1     8    17 !NA    !NA   
 2    26    21 !NA    !NA   
 3    16    12 !NA    !NA   
 4    NA    12 NA     !NA   
 5    32    NA !NA    NA    
 6    10    35 !NA    !NA   
 7    13    NA !NA    NA    
 8    60    17 !NA    !NA   
 9    33     7 !NA    !NA   
10    22    11 !NA    !NA   
# ℹ 490 more rows
dat_with_na |>
  bind_shadow() |>
  ggplot(aes(foo, color = bar_NA)) +
  geom_density()
Warning: Removed 100 rows containing non-finite outside the scale range
(`stat_density()`).

summarytools

The summarytools package provides alternative summaries, a bit more verbose than the ones above.

library(summarytools)
descr(peaks) # simple table, like skim
dfSummary(peaks) # pop-up descriptive page

It also allows for pop-out summary tables, within an RStudio session:

view(dfSummary(peaks)) # RStudio

Session info

sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.7.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Uzhgorod
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] naniar_1.1.0                    tidySummarizedExperiment_1.18.1
 [3] ttservice_0.5.3                 plyxp_1.2.0                    
 [5] nullranges_1.14.0               plyranges_1.28.0               
 [7] DESeq2_1.48.1                   SummarizedExperiment_1.38.1    
 [9] Biobase_2.68.0                  GenomicRanges_1.60.0           
[11] GenomeInfoDb_1.44.0             IRanges_2.42.0                 
[13] S4Vectors_0.46.0                BiocGenerics_0.54.0            
[15] generics_0.1.4                  MatrixGenerics_1.20.0          
[17] matrixStats_1.5.0               tidyr_1.3.1                    
[19] DataExplorer_0.8.3              ggplot2_3.5.2                  
[21] skimr_2.1.5                     dplyr_1.1.4                    
[23] readr_2.1.5                     here_1.0.1                     
[25] tibble_3.3.0                   

loaded via a namespace (and not attached):
 [1] bitops_1.0-9             gridExtra_2.3            rlang_1.1.6             
 [4] magrittr_2.0.3           ggridges_0.5.6           compiler_4.5.1          
 [7] mgcv_1.9-3               vctrs_0.6.5              reshape2_1.4.4          
[10] stringr_1.5.1            pkgconfig_2.0.3          crayon_1.5.3            
[13] fastmap_1.2.0            ellipsis_0.3.2           XVector_0.48.0          
[16] labeling_0.4.3           utf8_1.2.6               Rsamtools_2.24.0        
[19] rmarkdown_2.29           tzdb_0.5.0               UCSC.utils_1.4.0        
[22] visdat_0.6.0             purrr_1.1.0              bit_4.6.0               
[25] xfun_0.52                jsonlite_2.0.0           DelayedArray_0.34.1     
[28] BiocParallel_1.42.1      parallel_4.5.1           data.tree_1.1.0         
[31] R6_2.6.1                 stringi_1.8.7            RColorBrewer_1.1-3      
[34] rtracklayer_1.68.0       Rcpp_1.1.0               knitr_1.50              
[37] base64enc_0.1-3          Matrix_1.7-3             splines_4.5.1           
[40] igraph_2.1.4             tidyselect_1.2.1         abind_1.4-8             
[43] yaml_2.3.10              codetools_0.2-20         curl_6.4.0              
[46] lattice_0.22-7           plyr_1.8.9               InteractionSet_1.36.1   
[49] withr_3.0.2              S7_0.2.0                 evaluate_1.0.4          
[52] Biostrings_2.76.0        pillar_1.11.0            plotly_4.11.0           
[55] vroom_1.6.5              rprojroot_2.1.0          RCurl_1.98-1.17         
[58] hms_1.1.3                scales_1.4.0             glue_1.8.0              
[61] lazyeval_0.2.2           tools_4.5.1              BiocIO_1.18.0           
[64] data.table_1.17.8        locfit_1.5-9.12          GenomicAlignments_1.44.0
[67] XML_3.99-0.18            grid_4.5.1               colorspace_2.1-1        
[70] nlme_3.1-168             networkD3_0.4.1          GenomeInfoDbData_1.2.14 
[73] repr_1.1.7               restfulr_0.0.16          cli_3.6.5               
[76] fansi_1.0.6              viridisLite_0.4.2        S4Arrays_1.8.1          
[79] gtable_0.3.6             digest_0.6.37            SparseArray_1.8.0       
[82] rjson_0.2.23             htmlwidgets_1.6.4        farver_2.1.2            
[85] htmltools_0.5.8.1        lifecycle_1.0.4          httr_1.4.7              
[88] bit64_4.6.0-1           

Reuse

CC-BY